4. Main Analysis (Exploratory Data Analysis)
Part A: Price Distribution, Room Type Distribution, and Pet Policy
A1. Overview
In part A of the Main Analysis section, we start by looking at some patterns concerning the price and popularity of the listings in New York City.
We first look at a quick summary table that reveals information such as number of listings and number of reviews in each borough, as well as average price per night.
Next, we go into more details and find the top 50 most popular neighbourhoods by number of reviews, and the top 50 most expensive neighbourhoods by average price per night.
listing_by_borough <- listings_short %>% group_by(neighbourhood_group_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews))
listing_by_borough
## # A tibble: 5 x 4
## neighbourhood_group_cleansed count avg_price ttl_num_rev
## <chr> <int> <dbl> <int>
## 1 Bronx 879 85.8 16154
## 2 Brooklyn 20445 117 359156
## 3 Manhattan 22153 183 417541
## 4 Queens 5079 95.9 103206
## 5 Staten Island 296 127 6176
most_popular_neighbourhood <- listings_short %>% group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews)) %>% arrange(desc(ttl_num_rev))
most_popular_neighbourhood<-head(most_popular_neighbourhood,50)
most_expensive_neighbourhood <- listings_short %>% group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews)) %>% arrange(desc(avg_price))
most_expensive_neighbourhood<-head(most_expensive_neighbourhood,50)
ggplot(most_popular_neighbourhood, aes(reorder(neighbourhood_cleansed,ttl_num_rev), ttl_num_rev,fill=neighbourhood_group_cleansed)) + geom_col() + coord_flip()+
ggtitle("Top 50 Most Reviewed Neighbourhoods") +
ylab("Number of Reviews") +
xlab("Neighbourhoods") +
labs(fill = 'Boroughs') +
scale_fill_viridis(discrete=TRUE)
ggsave('images/most_reviewed.png')
ggplot(most_expensive_neighbourhood, aes(avg_price,reorder(neighbourhood_cleansed,avg_price))) + geom_point(aes(colour=neighbourhood_group_cleansed)) +
ggtitle("Top 50 Most Expensive Neighbourhoods") +
xlab("Average Price") +
ylab("Neighbourhoods") +
labs(colour = 'Boroughs') +
scale_color_viridis(discrete = TRUE)
From the summary table we see that Manhattan (22153) and Brooklyn (20445) have the most listings. On average, Manhattan is the most expensive ($183 per night), followed by Staten Island ($127 per night), Queens ($96 per night), and Bronx ($86 per night).
Top five most reviewed neighbourhoods are Bedford-Stuyvesant (Brooklyn), Williamsburg (Brooklyn), Harlem (Manhattan), Hell’s Kitchen (Manhattan), and East Village (Manhattan).
Top five most expensive neighbourhoods are Fort Wadsworth (Staten Island), Westerleigh (Staten Island), Lighthouse Hill (Staten Island), Woodrow (Staten Island), and Tribeca (Manhattan). Although Staten Island doesn’t have many lisitngs, top four most expensive neighbourhood by average price are all in Staten Island.
A2. Price Distribution
We then want to put more focus on the prices and draw up two sets of histrogram graphs to learn about general price trend. But before doing that, we first want to look at outliers and remove some of them to reduce skewness.
Outliers
boxplot(listings_short$price,main='Price Boxplot', ylab = 'Price per Night')
hist(listings_short$price, main = 'Price Histrogram',xlab='Price per Night')
summary(listings_short$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 65.0 100.0 144.3 170.0 10000.0
Here we looked at the price per night range of all the listings, trying to identify outliers. We found that the vast majority of listings have a prive lower than $2000. In order to draw informative histrograms, we only kept observations with a price lower than $750.
Price Histrograms
listing_by_borough <- listings_short %>% group_by(neighbourhood_group_cleansed) %>% summarize(count = n(),avg_price = mean(price),avg_num_rev = mean(number_of_reviews))
listing_leq_750 <- listings_short %>% filter(price<=750)
ggplot(listing_leq_750, aes(price)) +
geom_histogram(color = 'black', fill = 'gray') +
ggtitle("Listing Price Distribution") +
xlab("Price per Night") +
ylab("Number of Listings")
ggplot(listing_leq_750, aes(price)) +
geom_histogram(color='black',size=0.2,fill='gray') +
facet_wrap(~neighbourhood_group_cleansed)+
ggtitle("Listing Price Distribution by Borough") +
xlab("Price per Night") +
ylab("Number of Listings")
ggsave('images/price_by_borough.png')
Here we want to explore the distribution of listing prices. We first plotted an overall histrogram and learned that the listing prices are still highly right skewed even after cutting out outliers. Most listings have prices under $200 per night, and the most common prices are between $50 to $100. Considering the average prices of hotels in New York City is somewhere around $200 per night, Airbnb maybe a more economical choice.
We then faceted on boroughs, and see that price distribution in Brooklyn is even more highly skewed, whereas Manhattan’s price distribution is a little more spread out.
More specificly, Brooklyn tends to have a very high number of listings clustered in the price range of $50 to $100. Overall price distribution in Brooklyn is highly concentrated on the cheaper side– over 90% of listings are below $200 per night.
Manhattan on the other hand has a more even distribution. The most frequent prices are still between $50 - $100 but with a much lower percentage. Average prices in Manhattan are high due to longer and fatter tails in the distribution. This tells us that Airbnb in Brooklyn in general are cheaper than the overall distribution in the last graph, while Airbnb in Manhattan is a little bit more expensive.
Consider our findings from the introduction section - the fact that the top two most popular neighbourhoods are both located in Brooklyn maybe partially due to the lower price trends in this Borough.
A3. Room Types
We noticed that there are three types of room available across Airbnb postings - private room, shared room, or entire home or apartment. We are interested to see if the distribution of these room types are correlated with boroughs and we decide to use mosaic plot to explore that relationship.
#colorblind-friendly palette
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
p1<-mosaic(main = "Room Type by Boroughs",room_type ~ neighbourhood_group_cleansed, listings_short,gp = gpar(fill =cbbPalette[4:6], col = "white"),labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(10,0,0,10)))
p1
## room_type Entire home/apt Private room Shared room
## neighbourhood_group_cleansed
## Bronx 259 583 37
## Brooklyn 9054 10969 422
## Manhattan 12862 8725 566
## Queens 1786 3086 207
## Staten Island 136 156 4
png('images/room_type.png')
mosaic(main = "Room Type by Boroughs",room_type ~ neighbourhood_group_cleansed, listings_short,gp = gpar(fill =cbbPalette[4:6], col = "white"),labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(10,0,0,10)))
dev.off()
## quartz_off_screen
## 2
From this plot we see that room types are somewhat correlated with boroughs. More specifically, Manhattan hosts are more likely to list entire home/apt, while Queens and Brooklyn hosts are more likely to list private room within a house or apartment. Since Bronx and Staten Island have very small number of listings, their correlation with room types is nearly neglegible in the bigger picture. However, a further study on room types focused on these two boroughs may bring insteresting insights. Shared rooms are very rare across all boroughs.
Room Types + Price
Since we previously looked at price distribution, here we combine price distribution with room types for further insights.
ggplot(listing_leq_750, aes(price, fill = room_type)) +
geom_histogram(color='black',size=0.2) +
facet_wrap(~neighbourhood_group_cleansed)+
ggtitle("Listing Price by Borough and Room Type") +
xlab("Price per Night")+
ylab("Number of Listings")+
labs(fill = 'Room Type') +
scale_fill_viridis(discrete = TRUE)
When combine room types and listing prices, we see that as price goes up, room type “Entire home/apt” starts to take a higher percentage, which is consistent with common sense since entire home usually means more space and pricacy, hence higher price. This trend stays true across all boroughs.
A4. Pet Policy
Since some guest may be traveling with their pets, the last section in Part A looks at the percentage of listings that allow pets to stay in their palces.
We first use a mosaic plot to see the pet policy is correlated with boroughs. We then use a grouped bar charts to looked the shares of listings that allow for specific types of pets.
listings_pets <- read_csv("listings_pets.csv", col_types = cols())
ls_pets <- listings_pets %>%
mutate(pets = ifelse(str_detect(amenities,'pets allowed') | str_detect(amenities,'cat') | str_detect(amenities,'dog'),1,0))
ls_pets <- ls_pets %>%
mutate(dog = ifelse(str_detect(amenities,'dog'),1,0))
ls_pets <- ls_pets %>%
mutate(cat = ifelse(str_detect(amenities,'cat'),1,0))
ls_pets <- ls_pets %>%
mutate(Unspecified_Other = ifelse(cat==0 && dog==0,1,0))
ls_pets <- ls_pets %>%
mutate(pet_category = case_when(
cat==1 & dog ==1 ~ 'Cat and Dog',
cat ==1 & dog ==0 ~ 'Cat Only',
cat ==0 & dog==1 ~ 'Dog Only',
cat==0 & dog==0 ~ 'Unspecified'
))
pets_only<-filter(ls_pets,pets==1)
mosaic(main ="Are pets allowed?",pets ~ neighbourhood_group_cleansed, ls_pets, gp = gpar(fill = cbbPalette[2:3], col = "white"), labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(0,45,30,30)))
ggplot(pets_only, aes(x=neighbourhood_group_cleansed, fill=pet_category)) +
geom_bar(position='dodge') +
ggtitle('Pet Policy by Borough and Pet Type') +
xlab("Neighbourhoods") +
ylab("Number of Listings") +
labs(fill='Pet Category') +
scale_fill_colorblind()
From the mosaic plot we see that pet policy may only be slightly correlated with boroughts. Overall, most listings do not explicitly state that they welcome pets, only a small percentage of listings explicitly express that they are willing to accept pets. This trend is consistent across all boroughs. Brooklyn is slightly more pet friendly, compared to the others.
For the listings that do allow pets, we break them down into whether they specify what type of pets are allowed, by looking at the group bar char graph. Among the listings that allow pets, cats in general take a higher percentage and therefore are porbably more welcomed than dogs, especially in Brooklyn. Also a large number of listings only indicate that they are pet freindly, but do not restrict to specific pet types.
Part B - Text Analytics
In this part, we moved onto text analytics to reveal how listing owners portray their places and visitors describe their experiences and express their feelings after staying at a specific venue of Airbnb. Listing summary and amenities description provide a good clue to for the former question and reviews comments shed insights on the latter question. The use of adjective words are indicative of positive and negative sentiment of visitors’ experiences. Moreover, we also investigated users’ behviours on sharing their experiecens as well as change of user behaviour over time on Airbnb based on the length of reviews in the combination of different NYC boroughs, room-types or rating category.
Text analytics on Airbnb NYC dataset mainly focused on the three areas: listings summary, amenities description and review comments. The frist section is the description of preprocessing, following by the analysis of data quality and the interesting questions to address in each area. Following are the questions we asked to understand user behaviours and experiences on Airbnb:
- Which adjective words appear most frequently in listing summary?
- Are words of summary in each rating category the same?
- What are the most frequent amenities words?
- Which adjective words appear in high-rating or low-rating reviews?
- Is the length of reviews of each room-type or borough different?
- Is the length of review of each room-type or rating category different?
- Does the length of reviews become shorter or longer over time?
Preprocessing
Most of the preprocessing in text analytics was done in python. The python codes can be found in the jupyter notebook TextAnalytics_preprocessing.ipynb (https://github.com/chding12/EDAV_Final_Project/blob/master/TextAnalytics_preprocessing_NLP.ipynb). Originally we have listingsAll.csv and reviewsAll.csv containing airbnb dataset in the past 10 years. After the inital cleaning process (refer to https://github.com/chding12/EDAV_Final_Project/blob/master/clean_data.Rmd) based on listingsAll.csv, we created cleaned_data.csv containing the required columns.
In TextAnalytics_preprocessing.ipynb, 4 steps are executed to complete the preprocessing for text analytics. The dataset was first loaded into pandas dataframe (https://pandas.pydata.org/) and then processed.
Add rating ategory to the cleaned file We created a new column by assigning each listing with a category based their review scores. The distribution of review scores is shown in ratings distribution of the Overview section using dataset from listingsAll.csv and is used as references to create 3 categories of ratings (A, B and C). The new column “rating_categ” was combined to the cleaned data to create CleanedPlus.csv.
Adjective words in summary for visualization In order to plot a wordcloud of adjective words in summary, we created a table of words and its frequencies for each rating category. We first tokenized summary text in each rating category into individual words and did pos tagging to identify the part-of-speech tag of each word. As pos-tagging is case-sensitive, we only turned each words into lowercases later. The finalized data with three columns (word, freq, category) was written to adj_summary.csv, consisting of all the three rating categories. This dataset is later used in the Summary section.
Adjective words in review for visualization We worked on review comments from reviewsAll.csv. (1) The review comments was tokenized into individual words. (2) English stopwords, such as “a”, “the”, etc…, were removed from the words. (3) Post tagging. (4) Find adjective words. (5) Find nouns. (6) Count the number of adjective, noun and total words. After the preprocessing, we kept only required columns for visualization to reviewsPlus.csv. This dataset is used in data quality analysis and in the Review section. Moreover, we counted adjective words in each rating category and stored the data of three columns (word, freq, category) to adj_reviews.csv for wordcloud visualization in the Review section.
Amenities word frequency Key words of amenities description were extracted and counted. The words and frequencies of amenities were written to amenities.csv for the Amenities section.
B1. Overview
Ratings Distribution
library(dplyr)
mydata1 <- ls %>%
dplyr::filter(!is.na(review_scores_rating))
hist(mydata1$review_scores_rating)
The purpose of this plot is to observe the distribution of review scores of each listing to come up with three rating categories. Category A indicates excellent listings with reviews scores from 90 to 100. Category B represent the good listings with reviews scores ranging between 75 to 90. Category C is a group of not-so-good listings where reviews scores are less than 75. It is interesting to notice that the majority of listings have high review scores.
B2. Listing summary
Which adjective words appear most frequently in listing summary?
#install.packages("wordcloud")
#install.packages("RColorBrewer")
library(wordcloud)
library(RColorBrewer)
adj_s_all <- read.csv('adj_summary.csv', stringsAsFactors=FALSE,sep = "#")
adj_s_all$freq <- as.numeric(adj_s_all$freq)
#all
words_to_remove <- c('nan', 'll')
s1 <- adj_s_all %>%
select(c("word", "freq")) %>%
group_by(word) %>%
summarize(freq=sum(freq)) %>%
filter(!word %in% words_to_remove)
wordcloud(words = s1$word, freq = s1$freq,
min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Paired"))
What aspects listing owners think are the most important and attractive to visitors can be shown by how they describe their spaces. As we can see from the wordcloud representing most common adjective words in the listing summary, “private” is the most frequently used words to describe a listing. It corresponds to the nature of Airbnb business model as it is based on the concept of sharing hosts’ personal housing with visitors. The other common words can be divided into two groups. First group, such “beautiful”, “great”, “comfortable”, expresses abstract ideas of the venue to stimulate reader’s imagination, while the second group, such as “new”, “quiet”, “spacious”, “large”, “clean”, gives readers more hints on the physical condition of the place to bring vivid pictures into user’s imagination.
Are words of summary in each rating category the same?
library(ggplot2)
adj_s_all %>%
group_by(category) %>%
top_n(n=10, wt=freq) %>%
ungroup() %>%
ggplot(aes(reorder(word, freq), freq)) +
geom_col(show.legend = FALSE) +
facet_wrap(~category, scales = 'free') +
labs(y = "words frequency",
X="most frequent words") +
coord_flip()
When listing owners describe their place, they would tend not only to portray their spaces to impress visitors but also be authentic to the real condition to certain extent. It turns out from our data exploration that these subtlties can be observed from the use of adjective words in their listing summary. We studied the summary from each rating groups and found interesting different usage of adjective words.
Top 10 common adjective words are very similar among the three rating categories like “great”, “private”, “beautiful”. Nevertheless, each group has one or two words different from the other group. Group A has the word “new”, which does not appear in group B or C. Group A and Group B both has the word “comfortable”, which is not present in group C. Both group B and C has the word “good”, which is not in the top list in group A. In group C, without “new” and “comfortable”, it has “available” instead. In conclusion, group A has “new” and “comfortable”, while group B has “comfortable” and “good”, and group C has “good” and “available”. The meaning of the three combinations already reveal the real condition of the listings to certain extent, which is confirmed by the rating groups of review scores.
B3. Amenities
What are the most frequent amenities words?
amenities <- read.csv("amenities.csv", stringsAsFactors=FALSE,sep = "#")
amenities <- amenities[!startsWith(amenities$word, "translation missing"),]
wordcloud(words = amenities$word, freq = amenities$freq, min.freq = 1,
max.words = 200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Apart from listing summary, amenities provide a clue how owners descrbe their spaces to make it more attractive to potential to visitors. Word frequency of amenities in listings are shown in the above wordcloud figure. When it comes to amenities, listing owners mostly focus on three areas basics, technology and safety.
- Basics: “essentials”, “hangers”, “shampoo”, “hair dryer”
- General Facility: “kitchen”, “heating”, “air conditioning”
- Technology: ’Wifi“,”laptop friendly workspace“,”internet"
- Safety: “smoke detector”, “carbon monoxide detector”, “fire extinguisher”.
Apparently, “kitchen”, “heating” and “air conditioning”, which provides comfort to a venue, are very often mentioned as part of the amenities provided by the listing. Safety is considered very important by the owners as those three words are comparably large as well. In terms of technology, “Wifi”, providing our connection to the world, is as essential as elements providing pyhsical comfort such as “kitchen”" or “heating”. Commonly mentioned items in basics include “hangers”, “hair dryer” and “shampoo”. These things are comparably easier to provide and it is nice to have them when visitors temporily stay at a place.
B4. Reviews
Which words appear in high-rating or low-rating reviews?
library(tidytext)
library(reshape2)
library(wordcloud)
adj_r <- read.csv("adj_review.csv", stringsAsFactors=FALSE,sep = "#")
adj_r_A <- adj_r %>%
filter(category == 'A') %>%
top_n(n=200, wt=freq)
adj_r_C <- adj_r %>%
filter(category == 'C') %>%
top_n(n=200, wt=freq)
# The palette with black:
cbbPalette_s <- c("#0072B2", "#D55E00", "#CC79A7")
# words in high-rating reviews
df_A <- adj_r_A %>%
inner_join(get_sentiments("bing"))
wordcloud::wordcloud(df_A$word, df_A$freq,
ordered.colors=TRUE, max.words = 200,
colors=cbbPalette_s[factor(df_A$sentiment)])
# words in low-rating reviews
df_C <- adj_r_C %>%
inner_join(get_sentiments("bing"))
wordcloud(df_C$word, df_C$freq,
ordered.colors=TRUE, max.words = 200,
colors=cbbPalette_s[factor(df_C$sentiment)])
Visitors’ experiences after staying in a listing can be revealed by sentiment analysis on the words in their reviews. We first did sentiment analysis on the adjective words of the reivews to identify words with positive or negative sentiments. As it is shown in the above two wordclouds, mostly positive words appeared in the reviews of listings with good ratings (Group A). As we moved to group C, although we could still see a few positive words in the reviews, more negative words appeared. Moreover, some words, such as “wonderful”, “super”, and “fantastic”, to express exceptionally good feelings or opinions became less obvious or disappeared in the group C wordcloud. The choice of color is vision deficiency friendly.
Is the length of reviews of each room-type or borough different?
rp <- reviewsPlus[complete.cases(reviewsPlus),]
col_to_select <- c("id", "room_type", "neighbourhood_group_cleansed", 'rating_categ')
sc <- cleanedPlus %>%
select(col_to_select)
gl <- rp %>%
group_by(listing_id) %>%
summarize(
mean_adj=as.integer(mean(number_adj)),
total_adj=sum(number_adj),
mean_noun=as.integer(mean(number_noun)),
total_noun=sum(number_noun),
mean_words=as.integer(mean(number_words)),
total_words=sum(number_words),
review_counts=n(),
perc_adj=total_adj/total_words,
perc_noun=total_noun/total_words,
perc_other=1-(perc_adj+perc_noun)
) %>%
inner_join(sc, by = c("listing_id" = "id"))
Here we created columns required for analysis on reviews. In the following analysis on the length of reviews, we used the average words of reviews to represent this feature.
#unloadNamespace("viridis")
#update.packages("viridis")
library(viridis)
library(dplyr)
library(ggplot2)
r1 <- gl %>%
group_by(neighbourhood_group_cleansed, room_type) %>%
dplyr::summarise(avg_words = mean(mean_words))
# Use position=position_dodge()
p <- ggplot(data=r1, aes(x=neighbourhood_group_cleansed,
y=avg_words, fill=room_type)) +
geom_bar(stat="identity", position=position_dodge()) +
scale_fill_viridis(discrete=TRUE, option="magma")
p
# check the number of reviews of shared room in Staten Island
ss <- gl %>%
filter(room_type == "Shared room") %>%
filter(neighbourhood_group_cleansed == "Staten Island") %>%
count()
ss
## # A tibble: 1 x 1
## n
## <int>
## 1 1
As each borough has its own characteristics and would attract visitors coming to NYC for differnt purposes, we would like to know if the length of reviews would be varied according to room-types in a specific borough. As we can see from the bar chart, there is no correlation of the length of reviews and boroughs. However, people tend to write longer reviews for entire home/apartment than for private room and shared room. It is worth pointing out that shared room on Staten Island has lengthy reviews compared to the other combinations. It might be due to a small number of reviews written for shared room on Staten Island, which is confirmed by finding out that there is onyl one review in the combination of these two categorical variables. The choice of color is vision deficiency friendly.
Is the length of review of each room-type or rating category different?
library(grid)
r2 <- gl %>%
group_by(rating_categ, room_type) %>%
dplyr::summarise(avg_words = mean(mean_words))
# mosaic plot
r2$room_type <- as.factor(r2$room_type)
r2$rating_categ <- as.factor(r2$rating_categ)
r2t <- xtabs(avg_words ~ rating_categ+room_type,
r2)
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
m2 <- vcd::mosaic(rating_categ~room_type, data=r2t,
gp = gpar(fill = cbbPalette[5:7]))
# bar chart
p <- ggplot(data=r2, aes(x=rating_categ,
y=avg_words, fill=room_type)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_viridis(discrete=TRUE, option="magma")+
theme_grey(base_size = 16) +
ggtitle("Review Length in Rating Category") +
xlab("Rating Category") +
ylab("Average Number of Words")
p
ggsave(filename="images/ReviewsRatings.png", plot=p)
The length of reviews is a feature to observe how visitors would describe their good or not-so-good experiences and to see if difference exist among room types. In the mosaic plot, we can see there is an interaction between room type and rating category. While the number of words does not vary much for each rating category for shared room, it changes when people stay in entire home/apartment or private room with good experiences (rating group A or B). This observation reveals an interesing aspect of human behaviours showing that people are willing to put more efforts in terms of review length to talk about their good or wonderful experiences than average ones.
The change of review length is more prominent when we presented it using bar chart stratified by room types within each rating group. There is not difference among room types when people talk about not-so-satisfying experiences as shown in group C. In contrast, people use more words to describe their good and wonderful experiences (as shown in Group A and B) and, similar to the previous observation, the length of reviews increases from shared room, private room to entire home/apartment when visitors talk about their experiences in those difference venues. The choice of color is vision deficiency friendly.
Does the length of reviews become shorter or longer over time?
rpg <- rp %>%
left_join(sc, by = c("listing_id" = "id"))
rpg$date <- as.Date(rpg$date)
rpg$month <- as.Date(cut(rpg$date, breaks = "month"))
rpg$week <- as.Date(cut(rpg$date, breaks = "week"))
monthly changes over years
r3 <- rpg %>%
filter(date > "2014-01-01" & date <"2018-04-01") %>%
group_by(month, room_type) %>%
dplyr::summarise(avg_words = mean(number_words))
p <- ggplot(r3, aes(x=month, y=avg_words)) +
geom_line(aes(color=room_type), size=1) +
scale_color_viridis(discrete=TRUE, option="magma") +
ggtitle("Review Length over Years") +
xlab("Year") +
ylab("Average Number of Words")+
theme_grey(base_size = 16)
p
ggsave(filename="images/ReviewsRoomtype.png", plot=p)
weekly changes in 2017
r3 <- rpg %>%
filter(date > "2017-01-01" & date <"2018-01-31") %>%
group_by(week, room_type) %>%
dplyr::summarise(avg_words = mean(number_words))
p <- ggplot(r3, aes(x=week, y=avg_words)) +
geom_line(aes(color=room_type), size=1)
p + scale_color_viridis(discrete=TRUE, option="magma")
One way to follow the change of users’ behaviour on Airbnb over time is to see the amount of efforts they are willing to share their experiences, suggested by the length of review in the past years. The first time series plot clearly demonstrates that the number of reviews decreases over years and is independent of room types. In general, the number of reviews drops at the beginning of a year after the peak in the previous year. Again, the number peaks around Spring and Summer. It decreases for a few months and reaches a peak at the end of a year corresponding to Christmas holidays. There exists difference of reviews length over time among room types, but the overal ddecreasing pattern is similar. The choice of color is vision deficiency friendly.
The second time series plot provides a quick closer look to see if any pattern exists within in a year when people wirte reviews in 2017. We can see that entire home/apartment and private room has similar patterns while shared room has slightly different one. Although some patterns can be observed in this plot, further exploration are required to get more insights. The choice of color is vision deficiency friendly.
B5. Text Analytics Summary
In summary, we have interesting findings in our text analytics on how listing owners portray their places (adjective words used in summary and description of amenities) and user behaviours in writting reviews (words to express sentiments on accommodation, efforts spent to write reviews based on room types or experiences in the venue, change of review length over time).